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Twenty Questions Games Always End With Yes 

John T. Gill III, Member, IEEE, and William Wu 

Abstract — Huffman coding is often presented as the optimal solution 
to Twenty Questions. However, a caveat is that Twenty Questions games 
always end with a reply of "Yes," whereas Huffman codewords need not 
obey this constraint. We bring resolution to this issue, and prove that the 
average number of questions still lies between H(X) and H(X) + 1. 

Index Terms — Huffman coding, entropy, twenty questions game, Gal- 
lager's redundancy bound 

I. Introduction 

Twenty Questions is a classic parlour game involving an answerer 
and a questioner. The questioner must guess what object the answerer 
is thinking of, but is only allowed to ask questions whose answers 
are either "Yes" or "No". Popular initial questions include: "Is it an 
animal? Is it a vegetable? Is it a mineral?" The name of the game 
arises from the fact that if one bit of information could be acquired 
from each question, then twenty questions can distinguish between 
2 20 different objects, which should be more than sufficient. 

Courses in information theory often cast Huffman coding as the 
optimal approach to Twenty Questions. Given the set of possible 
objects and their probabilities, the questioner associates a Huffman 
codeword with each object, and then inquires about each bit of the 
codeword that the questioner is thinking of. The average number of 
questions is the Huffman tree's average depth, which is no less than 
H(X), and less than H{X) + 1, where X is the random variable 
indicating which of n objects the answerer is thinking of. 

However, upon further thought, there is a disparity between 
Huffman coding and how Twenty Questions games are played. 
Namely, real-world Twenty Questions games always terminate with 
the questioner pinpointing a specific object (e.g., "Is it a tank?" fl\), 
to which the answerer replies, "Yes!" In terms of source coding, this 
is equivalent to enforcing what we call the terminating yes constraint: 
all codewords must terminate with "1". Yet Huffman codes do not 
satisfy this constraint! In short, Huffman trees determine X, but do 
not specify X. 

In this paper, we first provide an example showing that simply 
appending branches to a Huffman tree may not produce the optimal 
Twenty Questions tree. We then prove that even under the terminating 
yes constraint, the average number of questions lies strictly between 
H(X) and H{X) + 1. 

II. Bar Bet: Guessing One of Four Objects 

Since Huffman coding solves Twenty Questions without a termi- 
nating yes, a natural idea is to first produce the Huffman tree, and then 
append branches to it so the terminating yes constraint is satisfied. 
Call the result an augmented Huffman tree. In the following example, 
we show that augmented Huffman trees may not be optimal Twenty 
Questions trees. 

Suppose there are only four objects the answerer could be thinking 
of. Denote them by xi, X2, xz, x±, with corresponding probabilities 
Pi > Pi > Pa > Pi- Figure Q] shows the only two four-leaf 
questioning trees possible up to graph isomorphism, where the 
dashed edges have been added to accommodate the terminating yes 
constraint. Although there are many possible assignments of objects 
to leaves, the assignments shown in Figure Q] are the only reasonable 
candidates which place lower probability objects at shallower depths. 
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Fig. 1. Four leaf trees. |(a)| Unary code. |(b)] Balanced code. 

One naturally imagines that the choice of a questioning tree should 
depend on the probability distribution. For instance, if the probabil- 
ities are close to uniform, we would guess that the balanced tree is 
better. However, if we let Qi and Q2 denote the average number of 
questions used by the unary and balanced trees, respectively, then 

Qi = Pi + 2pi + 3p 3 + 3p 4 = 1 + P2 + 2p 3 + 3p 4 , 
Q 2 = 2(pi + p 2 ) + 3p 3 + 3p 4 = 2 + p z + p 4 , 

and the difference is 

Qi - Qi = 2 + p 3 + Pi - 1 - P2 - 2p 3 - 3p 4 

= 1 - (P2 + Pi + 2^4) 
= pi - Pi > 0, 

with equality if and only if the distribution is uniform. Apparently the 
unary tree dominates the balanced tree, regardless of the probabilities! 
We think this makes for a good bar bet. 

This example demonstrates that augmenting a Huffman tree does 
not necessarily produce the optimal Twenty Questions tree. For 
example, if the probabilities were (3/10, 3/10, 2/10, 2/10), then the 
resulting augmented Huffman tree would yield the balanced tree, 
although the unary tree is better. In fact, among all distributions 
for which the Huffman algorithm produces a balanced tree, the 
maximum difference in the average number of questions required by 
the balanced and unary trees approaches 1/3, and is achieved with 
the distribution (| — e, ~ — e, | — e, 3e). 

III. Entropy Bounds On The Average Number of 
Questions 

Let Lh be the average depth of the Huffman tree, and let L yes 
be the average depth of the optimal Twenty Questions tree. In this 
section, we prove 

H(X)<L yes < H(X) + 1. 

Note that these are the same bounds satisfied by Lh, except for the 
strict inequality in the lower bound. We first require two Lemmas. 

Lemma 3.1 (Half -Bit Lemma): A binary tree that does not satisfy 
the terminating yes constraint can be modified to satisfy it while 
adding no more than 1/2 to the average depth. 

Proof: Let T be a tree that does not satisfy the terminating yes 
constraint. By appending a branch to all leaves whose codewords end 
with 0, we can construct an augmented tree T' that does satisfy it. 
(This forces all leaves to sway in the same direction.) To minimize 
the increase in average depth, interchange siblings in T as necessary 
so that the lower probability sibling is always the one that receives 
the appended branch. Consequently, if the average length of T is L, 
the average length of T' will be no more than L + 1/2. ■ 
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Lemma 3.2 (Gallager's Redundancy Bound): For all finite distri- 
butions, Lh — H(X) < pi + a, where p\ is the largest probability, 
and a := 1 — log 2 e + log 2 (log 2 e) « 0.086. 

Proof: See Gallager (3_). ■ 

Theorem 3.3: H(X) < L yes < H(X) + 1. 

Proof: We first establish the lower bound. By pruning the 
appended branches of the optimal Twenty Questions tree, we have a 
new tree of reduced average depth in which every internal node has 
two children. Amongst all such trees, the Huffman tree has lowest 
average depth, so Lh < L yea . Lastly, H(X) < Lh (see Cover and 
Thomas t2l). 

For the upper bound, we consider two cases. First, suppose pi < 
0.4. From Lemma \32\ 

Lh-H(X) < Pl +o<l/2. 

Adding 1/2 to both sides and rearranging, 

Lh + 1/2 < H(X) + 1. 

From Lemma [3~T1 L yea < Lh + 1/2. Thus, 

L yes < H(X) + 1. 

When pi > 0.4, we prove the upper bound by induction on the 
number of objects. Let X be a random variable taking n possible 
values, and let T be the tree with minimum average depth under 
both the terminating yes constraint and the additional constraint that 
the most probable object has a codeword of length one. This tree T 
is illustrated in Figure [2] While this additional constraint may result 
in a suboptimal tree, we will show that T satisfies the desired upper 
bound regardless, and thus the optimal Twenty Questions tree does 
also. 
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Fig. 2. Tree f used in the induction argument when p\ > 0.4. 

Let L denote the average depth of T, and let T 2 denote the right 
subtree containing n — 1 leaves. Then 

L = 1+(1- P1 )L{T 2 ) 

where L(T 2 ) denotes the average depth of T 2 . Also, by the grouping 
law for entropy, 

H(X) = H(pi) + (1 - pi)H(X 2 ) 

where X2 is the random variable given by the remaining n — 1 
normalized probabilities ( ^ , ^ , . . . , ^ ). Subtracting these 
equations, 

L - H{X) = 1 - H{pi) + (1 - Pi){L{T 2 ) - H(X 2 )) 

By construction of T, it follows that T 2 is an optimal Twenty Ques- 
tions tree for X 2 . By the induction hypothesis, L(T 2 ) —H{X2) < 1, 
and thus 

L-H(X) <2-(H{ Pl )+ Pl ). 



Since p\ could be any value in [0.4, 1], we want the largest upper 
bound, to cover all our bases. Setting pi = 1, 

L - H(X) < 1. 

Thus, 

Ly es <L< H(X) + 1. 

■ 

Lastly, by comparing the bounds 

H(X) <L H < H{X) + 1 (1) 

H(X) < Lyes < H(X) + 1 (2) 

we conclude that 

H(X) <L H < L y es < H(X) + 1. (3) 

Since the classical bounds in Equation Q] are tight, it follows that the 
bounds in Theorem 13.31 are also tight. 

IV. Conclusion 

We have provided resolution to a disconnect between the Twenty 
Questions game and Huffman coding. Although Twenty Questions 
games always end with "Yes", thankfully the average number of 
questions they require is still within one of the entropy - a nice 
answer to a simple problem. As Forrest Gump would say, "One less 
thing to worry about." 
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